{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/oumi-ai/oumi/blob/main/notebooks/Oumi - Evaluation with MT Bench.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation with MT Bench\n",
    "\n",
    "This notebook discusses how you can run end-to-end evaluations for your trained model with [MT Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Evaluating with MT Bench is a 2-step process. In the first step, we run inference for your model to generate answers for 80 multi-turn MT-bench questions. In the second step, we generate judgments (GPT-4 is the default judge) comparing your model's answers vs. reference answers. Each answer is scored [1, 10], considering factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Machine requirements\n",
    "\n",
    "❗**NOTICE:** It is required to run this notebook on a machine with CUDA support, because MT Bench only runs on CUDA. If running on Google Colab, you can use the free T4 GPU runtime (Colab Menu: `Runtime` -> `Change runtime type`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CUDA version: 12.1\n",
      "Number of GPUs: 1\n",
      "GPU type: NVIDIA A100-SXM4-40GB\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    print(f\"CUDA version: {torch.version.cuda}\")\n",
    "    print(f\"Number of GPUs: {torch.cuda.device_count()}\")\n",
    "    print(f\"GPU type: {torch.cuda.get_device_name()}\")\n",
    "else:\n",
    "    print(\"Error! MT Bench will NOT run in a machine without CUDA.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If your local machine cannot run this notebook, you can instead run this notebook on a cloud platform. The following snippet demonstrates how to open a VSCode instance backed by a GCP node with an A100 GPU, from which the notebook can be run. For installation details, please refer to the [gcloud CLI](https://cloud.google.com/sdk/docs/install) page.\n",
    "\n",
    "```bash\n",
    "! gcloud auth application-default login  # Authenticate with GCP.\n",
    "\n",
    "# The required GPU count depends on your model.\n",
    "# Here we use 1 A100 40GB GPU.\n",
    "! make gcpcode ARGS=\"--resources.accelerators A100:1\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites and Configuration\n",
    "\n",
    "First, start by cloning the [FastChat](https://github.com/lm-sys/FastChat) repo, which includes the MT Bench framework.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into '/tmp/oumi/FastChat'...\n",
      "remote: Enumerating objects: 8425, done.\u001b[K\n",
      "remote: Counting objects: 100% (226/226), done.\u001b[K\n",
      "remote: Compressing objects: 100% (115/115), done.\u001b[K\n",
      "remote: Total 8425 (delta 164), reused 154 (delta 108), pack-reused 8199 (from 1)\u001b[K\n",
      "Receiving objects: 100% (8425/8425), 34.48 MiB | 37.25 MiB/s, done.\n",
      "Resolving deltas: 100% (6406/6406), done.\n"
     ]
    }
   ],
   "source": [
    "FAST_CHAT_REPO = \"/tmp/oumi/FastChat\"  # Folder to clone to.\n",
    "! git clone https://github.com/lm-sys/FastChat.git $FAST_CHAT_REPO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, navigate to that folder and pip install the packages `model_worker` and `llm_judge`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.chdir(FAST_CHAT_REPO)\n",
    "! pip install -q -e \".[model_worker,llm_judge]\" jsonlines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When comparing your model's responses vs. the reference responses to calculate the score, a judge is needed. By default, the judge is set to GPT4. To access GPT-4 models, an Open API key is required. Details on creating an OpenAI account and generating a key can be found at [OpenAI's quickstart webpage](https://platform.openai.com/docs/quickstart)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = \"\"  # NOTE: Set your OpenAI API key here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>⚠️ Cost considerations</b>: To get an accurate estimate of the cost to judge 160 examples (80 x 2-turn conversations) with GPT4, please visit [OpenAI's pricing](https://openai.com/api/pricing/) page. The cost for judging Llama 3.2 1B IT responses is <b>$5.10</b> as of December 2024. Since this notebook is sample code, we will only annotate and judge 3 x 2-turn conversations, reducing the GPT-4 judgment cost to only <b>0.2¢</b>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM_EXAMPLES = 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, point to your model (`MODEL_PATH`). MT Bench supports HuggingFace repo IDs and paths to local folders that contain your model. \n",
    "Also, please provide a (human friendly) custom `MODEL_DISPLAY_NAME` for your model; this will be used to uniquely reference your model when generating judgments or inspecting scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL_PATH = \"HuggingFaceTB/SmolLM2-135M-Instruct\"\n",
    "MODEL_DISPLAY_NAME = \"my_model\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Run inference\n",
    "\n",
    "Navigate to the LLM judge folder and run inference, passing in your model path and model id as shown below. Since this is sample code, note that we are running inference only for the first `NUM_EXAMPLES` examples.\n",
    "\n",
    "Additional arguments to consider (more details [here](https://github.com/lm-sys/FastChat/blob/1cd4b74fa00d1a60852ea9c88e4cc4fc070e4512/fastchat/llm_judge/gen_model_answer.py#L209C1-L271C6)):\n",
    "- You can change the location of the output file by setting `--answer-file=<file path>`.\n",
    "- You can restrict the max number of generated tokens by your model by setting `--max-new-token=<number of tokens>`.\n",
    "- You can specify the model revision to be loaded by `--revision=<model revision>`.\n",
    "- You can set the number of GPUs to be used when running inference with your model with `--num-gpus-per-model=<num GPUs>` (if not set, the default is 1).\n",
    "- You can restrict the GPU memory used when running inference by `--max-gpu-memory=<max memory>`.\n",
    "- You can overwrite the default `dtype` with `--dtype=<dtype>` (if not set, the default is to use float16 on GPU, float32 on CPU).\n",
    "- You can run inference on a subset of the examples by setting the index of the first question with `--question-begin=<question index>` and the index of the last question with `--question-end=<question index>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Output to data/mt_bench/model_answer/my_model.jsonl\n",
      "  0%|                                                     | 0/3 [00:00<?, ?it/s]\n",
      "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)\n",
      " 33%|███████████████                              | 1/3 [00:05<00:10,  5.26s/it]\n",
      " 67%|██████████████████████████████               | 2/3 [00:17<00:09,  9.49s/it]\n",
      "100%|█████████████████████████████████████████████| 3/3 [00:44<00:00, 14.84s/it]\n"
     ]
    }
   ],
   "source": [
    "LLM_JUDGE_FOLDER = f\"{FAST_CHAT_REPO}/fastchat/llm_judge\"\n",
    "os.chdir(LLM_JUDGE_FOLDER)\n",
    "\n",
    "! python gen_model_answer.py \\\n",
    "    --model-path $MODEL_PATH \\\n",
    "    --model-id $MODEL_DISPLAY_NAME \\\n",
    "    --question-end $NUM_EXAMPLES"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspect the inference results. \n",
    "The default output filename is `<MODEL_DISPLAY_NAME>.jsonl`. \n",
    "Note that the question IDs of the 80 multi-turn questions are 81-160. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-----[ question=81 turn=0 ]-----\n",
      "I'd be happy to help you write a travel blog post about your recent trip to Hawaii. Here's a draft:\n",
      "**Title:** A Tropical Paradise Awaits: Exploring the Best of Hawaii\n",
      "**Introduction:**\n",
      "As I stepped off the plane in Honolulu, I couldn't help but feel a sense of excitement and relaxation wash over me. The warm tropical air, the sound of ukulele music drifting through the airport, and the stunning beaches beckoning me to come and explore – I knew that I was in for an unforgettable adventure. My recent trip to Hawaii was a journey of discovery, where I immersed myself in the rich cultural heritage, breathtaking natural beauty, and warm hospitality of this enchanting island.\n",
      "\n",
      "**Must-see Attractions:**\n",
      "One of the first things that caught my attention was the historic Pearl Harbor. A somber reminder of the island's complex history, the USS Arizona Memorial is a poignant tribute to the lives lost during the attack. From there, I headed to the iconic Diamond Head State Monument, an imposing volcanic crater perched atop a lush green slope. The hike to the summit offered breathtaking views of the ocean and surrounding landscape, making it a must-do activity for any visitor.\n",
      "\n",
      "**Cultural Experiences:**\n",
      "As I delved deeper into the island's culture, I was struck by the warm and welcoming nature of the locals. I attended a traditional Hawaiian luau, where I feasted on local delicacies like kalua pig and poke bowls. I also visited the Bishop Museum, which offers a fascinating glimpse into the island's history, from ancient Polynesian times to modern-day surfing culture.\n",
      "\n",
      "**Tips and Insights:**\n",
      "If you're planning a trip to Hawaii, here are a few tips to keep in mind:\n",
      "\n",
      "* Be sure to check the weather forecast before your trip, as the island's tropical climate can be unpredictable.\n",
      "* Consider visiting during the shoulder season (April-May or September-November) for fewer crowds and lower prices.\n",
      "* Don't miss the stunning sunsets – they're truly unforgettable!\n",
      "\n",
      "**Conclusion:**\n",
      "My recent trip to Hawaii was a journey of discovery, where I immersed myself in the rich cultural heritage, breathtaking natural beauty, and warm hospitality of this enchanting island. From the historic Pearl Harbor to the stunning landscapes and warm welcomes of the locals, Hawaii is a destination that has something for everyone. Whether you're a beach lover, an adventure seeker, or simply looking for a relaxing getaway, this tropical paradise is sure to leave you feeling refreshed, revitalized, and eager to return. \n",
      "\n",
      "\n",
      "-----[ question=81 turn=1 ]-----\n",
      "Amazingly, attention-grabbing advice is available for a 10-year-old's birthday party ideas. Awesome activities include an adventurous archery session, an arts and crafts party, an amazing amusement park visit, an active adventure course, or an artsy arts and crafts party. \n",
      "\n",
      "\n",
      "-----[ question=82 turn=0 ]-----\n",
      "Here is a draft of the email:\n",
      "Subject: Feedback Request for Quarterly Financial Report\n",
      "\n",
      "Dear [Supervisor's Name],\n",
      "\n",
      "I hope this email finds you well. I am writing to seek your feedback on the Quarterly Financial Report I prepared for the latest quarter. The report covers key financial metrics, including revenue growth, expenses, and profit margins.\n",
      "\n",
      "I would appreciate it if you could provide specific feedback on the following:\n",
      "\n",
      "* Data analysis: Was the data analysis thorough and accurate? Were there any areas where I could have improved the analysis?\n",
      "* Presentation style: Was the presentation clear and easy to understand? Were there any suggestions for improvement?\n",
      "* Clarity of conclusions: Were the conclusions drawn from the data clear and concise? Were there any areas where I could have provided more context or explanation?\n",
      "\n",
      "I am committed to delivering high-quality financial reports that meet the company's standards. I would be grateful for any feedback you can provide to help me improve my work.\n",
      "\n",
      "Thank you for your time and consideration.\n",
      "\n",
      "Best regards,\n",
      "[Your Name] \n",
      "\n",
      "\n",
      "-----[ question=82 turn=1 ]-----\n",
      "Here are some suggestions for improvement:\n",
      "* Instead of using a generic greeting, use the supervisor's name to make it more personal and professional.\n",
      "* Instead of asking for specific feedback, ask for feedback on the overall quality of the report and what could be improved.\n",
      "* Consider adding a sentence or two to explain the purpose and scope of the report, as well as any specific goals or objectives that were met.\n",
      "* Use more formal language and avoid using technical jargon unless necessary.\n",
      "* Proofread the email for grammar, spelling, and punctuation errors before sending it.\n",
      "* Consider including a clear call-to-action, such as requesting a meeting to discuss the report further or asking for additional information.\n",
      "\n",
      "By taking the time to evaluate and critique your own response, you can ensure that your email is clear, concise, and effective in seeking feedback from your supervisor. \n",
      "\n",
      "\n",
      "-----[ question=83 turn=0 ]-----\n",
      "Here's an outline for the blog post:\n",
      "I. Introduction\n",
      "  * Briefly introduce the two smartphone models\n",
      "  * Mention the purpose of the post: to compare and contrast the features, performance, and user experience of the two models\n",
      "II. Key Features\n",
      "  * Discuss the camera capabilities and image quality of each model\n",
      "  * Compare the battery life and charging speeds\n",
      "  * Highlight the processor speed and RAM for smooth performance\n",
      "III. Performance\n",
      "  * Discuss the processor speed and RAM for seamless multitasking\n",
      "  * Compare the storage capacity and expandability options\n",
      "IV. User Experience\n",
      "  * Discuss the user interface and overall design of each model\n",
      "  * Compare the app selection and ease of use\n",
      "V. Conclusion\n",
      "  * Summarize the key points and differences between the two models\n",
      "  * Encourage readers to make an informed decision based on their needs and preferences\n",
      "Remember to keep the tone informative, engaging, and balanced. Good luck with your blog post! \n",
      "\n",
      "\n",
      "-----[ question=83 turn=1 ]-----\n",
      "Here's a limerick version of the response:\n",
      "There once were two smartphones so fine,\n",
      "One had a better camera, all the time.\n",
      "The other had more RAM,\n",
      "For smoother multitasking, and more,\n",
      "A better user experience, all in line. \n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import jsonlines\n",
    "\n",
    "INFERENCE_RESULTS_JSONL = (\n",
    "    f\"{LLM_JUDGE_FOLDER}/data/mt_bench/model_answer/{MODEL_DISPLAY_NAME}.jsonl\"\n",
    ")\n",
    "\n",
    "with jsonlines.open(INFERENCE_RESULTS_JSONL) as jsonl_examples:\n",
    "    for example in jsonl_examples:\n",
    "        for turn in range(2):\n",
    "            print(f\"-----[ question={example['question_id']} turn={turn} ]-----\")\n",
    "            print(example[\"choices\"][0][\"turns\"][turn], \"\\n\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Judge the model answers\n",
    "\n",
    "In this notebook, we demonstrate the recommended \"single-answer\" grading mode, where the judge assigns (for each turn) a score on a scale of 10. There are two additional grading options, where the judged model is compared pairwise to a single baseline model (`pairwise-baseline`) or multiple baseline models (`pairwise-all`) and win rates are generated. For more details, please read FastChat's [other grading options](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#other-grading-options) section.\n",
    "\n",
    "The command to invoke the GPT-4 judge to score each answer (single-answer grading) is shown below. Note that the `echo -ne '\\n'` prefix is required because we are invoking the shell via a notebook and that script (`gen_judgment.py`) requires human verification by pressing \"Enter\". Piping the `\\n` character into the script emulates pressing \"Enter\" right after executing `gen_judgment.py`. Also, note that we are only judging the first `NUM_EXAMPLES` examples, since this notebook is sample code. \n",
    "\n",
    "Additional arguments to consider (more details [here](https://github.com/lm-sys/FastChat/blob/1cd4b74fa00d1a60852ea9c88e4cc4fc070e4512/fastchat/llm_judge/gen_judgment.py#L170)):\n",
    "- You can change the location of the judgement file by setting `--judge-file=<file path>`.\n",
    "- You can enable multiple concurrent API calls to the judge by setting `--parallel=<number of concurrent API calls>` (default is 1).\n",
    "- You can use a different judge model by setting `--judge-model=<judge model name>` (default is `gpt-4`). This option is not documented and might not be very informative if you are interested in generating comparative results, since the reference model is the default model. \n",
    "- You can update the model that generated the reference answers by `--baseline-model=<judge model name>` (default is `gpt-3.5-turbo`). This option is also not documented, since the reference answers are used for comparative analysis. \n",
    "- You can test judgement for only a subset of the answers by setting `--first-n=<number of answers to judge>`. This flag is mainly used for debugging purposes; you can use it to reduce your judgment costs when testing the MT Bench framework. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stats:\n",
      "{\n",
      "    \"bench_name\": \"mt_bench\",\n",
      "    \"mode\": \"single\",\n",
      "    \"judge\": \"gpt-4\",\n",
      "    \"baseline\": null,\n",
      "    \"model_list\": [\n",
      "        \"my_model\"\n",
      "    ],\n",
      "    \"total_num_questions\": 3,\n",
      "    \"total_num_matches\": 6,\n",
      "    \"output_path\": \"data/mt_bench/model_judgment/gpt-4_single.jsonl\"\n",
      "}\n",
      "  0%|                                                     | 0/6 [00:00<?, ?it/s]question: 81, turn: 1, model: my_model, score: 10, judge: ('gpt-4', 'single-v1')\n",
      " 17%|███████▌                                     | 1/6 [00:06<00:31,  6.25s/it]question: 82, turn: 1, model: my_model, score: 10, judge: ('gpt-4', 'single-v1')\n",
      " 33%|███████████████                              | 2/6 [00:12<00:24,  6.12s/it]question: 83, turn: 1, model: my_model, score: 9, judge: ('gpt-4', 'single-v1')\n",
      " 50%|██████████████████████▌                      | 3/6 [00:15<00:14,  4.83s/it]question: 81, turn: 2, model: my_model, score: 1, judge: ('gpt-4', 'single-v1-multi-turn')\n",
      " 67%|██████████████████████████████               | 4/6 [00:18<00:08,  4.24s/it]question: 82, turn: 2, model: my_model, score: 8, judge: ('gpt-4', 'single-v1-multi-turn')\n",
      " 83%|█████████████████████████████████████▌       | 5/6 [00:25<00:04,  4.92s/it]question: 83, turn: 2, model: my_model, score: 7, judge: ('gpt-4', 'single-v1-multi-turn')\n",
      "100%|█████████████████████████████████████████████| 6/6 [00:31<00:00,  5.23s/it]\n"
     ]
    }
   ],
   "source": [
    "os.chdir(LLM_JUDGE_FOLDER)\n",
    "\n",
    "! echo -ne '\\n' \\\n",
    "    | python gen_judgment.py \\\n",
    "        --model-list $MODEL_DISPLAY_NAME \\\n",
    "        --first-n $NUM_EXAMPLES"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspect the judgments and the scores for each model answer. \n",
    "The default output filename is `gpt-4_single.jsonl`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "question=81 turn=1 score=10\n",
      "judgement: The assistant's response is highly relevant, accurate, and detailed. It provides a comprehensive and engaging draft for a travel blog post about a recent trip to Hawaii. The assistant highlights cultural experiences, must-see attractions, and even provides some useful tips for future travelers. The ...\n",
      "question=82 turn=1 score=10\n",
      "judgement: The assistant's response is highly relevant, accurate, and detailed. It provides a professional and concise draft of an email that addresses the user's request. The assistant has included all the specific points the user wanted to ask about, such as data analysis, presentation style, and clarity of ...\n",
      "question=83 turn=1 score=9\n",
      "judgement: The assistant's response is highly relevant, accurate, and detailed. It provides a clear and comprehensive outline for a blog post comparing two smartphone models. The assistant covers all the key points requested by the user, including features, performance, and user experience. The assistant also ...\n",
      "question=81 turn=2 score=1\n",
      "judgement: The assistant's response is not relevant to the user's request. The user asked the assistant to rewrite the previous response (a travel blog post about Hawaii) starting every sentence with the letter 'A'. However, the assistant provided advice for a 10-year-old's birthday party, which is completely ...\n",
      "question=82 turn=2 score=8\n",
      "judgement: The assistant's self-evaluation is quite thorough and insightful. It provides a detailed critique of its own response, pointing out areas where it could have improved. The assistant suggests personalizing the greeting, asking for overall feedback, explaining the purpose and scope of the report, usin...\n",
      "question=83 turn=2 score=7\n",
      "judgement: The assistant's response is creative and relevant to the user's request. The assistant successfully rephrased the previous response into a limerick, maintaining the essence of the original message. However, the limerick does not cover all the points from the original response, such as battery life, ...\n"
     ]
    }
   ],
   "source": [
    "JUDGE_RESULTS_JSONL = (\n",
    "    f\"{LLM_JUDGE_FOLDER}/data/mt_bench/model_judgment/gpt-4_single.jsonl\"\n",
    ")\n",
    "\n",
    "with jsonlines.open(JUDGE_RESULTS_JSONL) as jsonl_examples:\n",
    "    for example in jsonl_examples:\n",
    "        print(\n",
    "            f\"question={example['question_id']} \"\n",
    "            f\"turn={example['turn']} \"\n",
    "            f\"score={example['score']}\\n\"\n",
    "            f\"judgement: {example['judgment'][:300]}...\"\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Retrieve your aggregate judgment score (with per-turn breakdown), as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mode: single\n",
      "Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl\n",
      "\n",
      "########## First turn ##########\n",
      "                  score\n",
      "model    turn          \n",
      "my_model 1     9.666667\n",
      "\n",
      "########## Second turn ##########\n",
      "                  score\n",
      "model    turn          \n",
      "my_model 2     5.333333\n",
      "\n",
      "########## Average ##########\n",
      "          score\n",
      "model          \n",
      "my_model    7.5\n"
     ]
    }
   ],
   "source": [
    "! python show_result.py --model-list $MODEL_DISPLAY_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can programmatically calculate the judgement score as follows. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.1666666666666665\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_judge_results = pd.read_json(JUDGE_RESULTS_JSONL, lines=True)\n",
    "df_judge_results = df_judge_results.loc[\n",
    "    (df_judge_results[\"model\"] == MODEL_DISPLAY_NAME)\n",
    "    & (df_judge_results[\"score\"] != -1)\n",
    "]\n",
    "overall_score = df_judge_results[\"score\"].mean()\n",
    "print(overall_score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Optional] Retain your configuration for reproducibility\n",
    "\n",
    "In order to be able to repro your evaluation run in the future, do not forget to save the configuration of your evaluation, together with your evaluation metrics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import json\n",
    "\n",
    "import git\n",
    "\n",
    "evaluation_config_dict = {\n",
    "    \"fast_chat_repo\": {\n",
    "        \"repo_tag\": str(git.Repo(FAST_CHAT_REPO).tags[-1]),\n",
    "        \"commit_hash\": git.Repo(FAST_CHAT_REPO).head.commit.hexsha,\n",
    "    },\n",
    "    \"configs\": {\n",
    "        \"model_path\": MODEL_PATH,\n",
    "        \"model_id\": MODEL_DISPLAY_NAME,\n",
    "    },\n",
    "    \"timestamp\": str(datetime.datetime.now()),\n",
    "    \"eval_metrics\": {\"score\": overall_score},\n",
    "}\n",
    "\n",
    "evaluation_config_json = json.dumps(evaluation_config_dict, indent=2)\n",
    "with open(\"./evaluation_config.json\", \"w\") as output_file:\n",
    "    output_file.write(evaluation_config_json)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
